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Executive  Summary 


Telemedicine  literally  means  medicine  at  a  distance.  Presently, 
telemedicine  has  been  defined  as  the  use  of  telecommunications  and 
information  technologies  to  provide  health  care.  This  encompasses  the 
diagnosis,  treatment,  monitoring,  and  education  of  patients  regardless  of  the 
patient,  provider,  or  information  location  (Puskin  et  al.,  1995). 

There  have  been  a  number  of  efforts  to  use  telemedicine  to  deliver  health 
care  to  remote  and  medically  under-served  populations  over  the  last  40  years.  A 
review  of  the  telemedicine  programs  during  this  timer  however,  revealed  that 
only  one  major  project  continued  to  survive  after  the  withdrawal  of  external 
funding  (Hassel,  1995).  The  reasons  for  the  lack  of  success  of  these 
telemedicine  efforts;  however,  are  not  apparent.  This  is  in  large  part  because 
few,  if  any,  rigorous,  scientific  evaluations  were  done. 

The  problem  of  evaluating  telemedicine  applications  has  recently  been 
recognized  and  addressed  by  a  number  of  researchers  and  policy  makers  in  the 
area  (Bashshur,  1995;  Grigsby  et  al.,  1995;  Puskin  et  al.,  1995).  In  particular, 
the  DoD  Telemedicine  Evaluation  Working  Group  (TEWG)  proposed  a 
conceptual  framework  to  guide  the  development  of  methodologies  to  evaluate 
telemedicine  projects  in  the  Department  of  Defense.  The  five  areas  to  be 
evaluated  in  the  TEWG  framework  are  clinical  outcomes,  patient/provider 
satisfaction,  human  factors,  organizational  impact,  and  costs  and  benefits. 

One  of  the  areas  in  the  human  factors  evaluation  that  was  determined  to 
be  important  was  the  assessment  of  workload.  It  has  been  shown  in  other 
areas,  e.g.,  aviation,  that  changes  in  technological  applications  have  resulted  in 
additional  workload  demands  on  the  operator.  This  additional  workload  has 
been  related  to  decrements  in  performance.  It  is  believed  that  a  similar  change 
in  the  behavioral  and  cognitive  workload  of  the  health  care  provider  may  occur 
as  a  result  of  the  additional  requirements  imposed  by  telemedicine  applications. 
This  change  in  workload  may  result  in  an  increase  in  the  number  of  errors 
committed. 

Consequently,  a  review  of  the  cognitive  workload  literature  was  done  to 
identify  the  three  most  promising  workload  metrics  for  possible  use  in  measuring 
changes  in  workload  in  telemedicine  applications.  On  the  basis  of  this  literature 
review,  Whitaker  and  Birkmire-Peters  (See  Appendix  A)  proposed  three 
workload  metrics  as  possible  candidates  for  use  in  evaluating  telemedicine 
applications:  the  Modified  Cooper-Harper  (MCH)  Index,  the  NASA-Task  Load 
Index  (TLX)  and  its  subscales,  and  the  Subjective  Workload  Assessment 
Technique  (SWAT).  See  Appendix  A  for  the  complete  literature  review. 
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In  order  to  determine  the  optimal  metric,  it  is  necessary  to  subject  these 
candidate  measures  to  empirical  verification  and  validation  for  use  in  evaluating 
telemedicine  applications.  In  order  to  maintain  experimental  control,  as  well  as 
for  legal  and  ethical  reasons,  the  original  verification  and  validation  process 
must  be  done  in  a  laboratory  before  use  in  evaluating  actual  telemedicine 
applications.  Therefore,  Whitaker,  Hahus,  and  Birkmire-Peters  (See  Appendix 
B)  developed  a  surrogate  laboratory  task  that  taps  the  same  cognitive  demands 
as  expected  in  telemedicine  applications  and  developed  a  laboratory  protocol  for 
testing  workload  metrics.  The  candidate  workload  metrics  were  empirically 
tested  using  this  protocol.  Details  of  these  processes  and  their  results  are 
described  in  Appendix  B. 

The  work  reported  here  will  serve  as  the  basis  for  further  development  of 
a  methodology  for  evaluating  workload  in  telemedicine  applications.  The 
potential  metrics  need  to  be  verified  and  validated  with  more  appropriate 
populations.  Following  that  work,  it  should  be  possible  to  extend  the  findings  of 
this  research  to  the  evaluation  of  actual  telemedicine  applications. 
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Workload  Measurement  Classification 

Cognitive  workload  measures  can  be  classified  into  three  broad  areas: 
physiological,  performance  (primary  task  or  loading  task),  and  subjective 
(Schlegel,  1993). 

Physiological  measures.  The  human  body  responds  both  cognitively  and 
physiologically  to  the  demands  of  its  environment  and  its  tasks.  Physiological 
measures  that  vary  with  cognitive  demands  have  been  tested  as  potential 
metrics  of  those  cognitive  demands  (Wierwille  &  Eggemeier,  1993).  These 
measures  include  eye  blink  rate,  pupil  diameter,  P300  amplitude  and  latency, 
galvanic  skin  response  (GSR),  heart  rate,  heart  rate  variability,  and  certain  blood 
and  urine  fractions  (e.g.,  norepinephrine).  It  is  difficult  to  measure  physiological 
responses  because  of  the  large  number  of  trials  that  must  be  performed  to 
obtain  reliable  measures  and  because  of  the  invasive  nature  of  most  of  the 
measurement  technology.  Finally,  those  measures  that  have  been  obtained 
often  do  not  agree  with  one  another  (i.e.,  a  task  demand  may  be  reflected  in 
heart  rate  variability  but  not  in  GSR)  and  are  not  consistently  found  in  the 
literature.  . 

Performance  measures.  Performance  measures  for  the  primary  task  are 
the  most  direct  indication  of  changing  cognitive  workload  (Crabtree,  Bateman, 
and  Acton,  1984).  When  a  task  requires  primarily  cognitive  effort,  then  changes 
in  that  task's  performance  might  be  thought  to  provide  the  best  indication  of 
changes  in  the  level  of  cognitive  effort.  However,  this  will  only  prove  to  be  the 
case  if  the  performance  is  sensitive  to  these  changes  in  workload  (Boff  & 

Lincoln,  1988).  Instead  suppose  that  a  person  can  perform  a  task  with  low 
workload  demands  without  error  and  without  employing  the  maximum  cognitive 
resources  to  complete  the  task.  Then  suppose  that  the  demands  of  the  task  are 
increased;  now  the  worker  can  continue  to  perform  the  task  without  error  only  by 
expending  all  his  cognitive  resources;  that  is,  he  has  no  spare  resources,  but  he 
is  able  to  maintain  errorless  performance.  In  this  way,  performance  is  not  a 
sensitive  indicator  of  the  changes  in  cognitive  workload. 
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A  reasonable  question  to  ask  is  why  one  would  care  about  changes  in 
workload  that  do  not  affect  the  performance  of  the  task  of  interest.  When  a  task 
is  completed  during  testing  conditions,  we  usually  find  that  the  operator  is  rested, 
the  communication  among  team  members  is  perfect,  the  time  on  task  was 
limited,  and  no  emergencies  occurred.  In  these  circumstances,  task 
performance  may  not  be  a  sensitive  indicator  of  how  close  an  operator  is  to 
using  all  available  resources.  However,  whenever  any  one  of  these 
circumstances  is  compromised,  as  they  often  are  during  actual  operating 
conditions,  then  the  operator  using  all  his  cognitive  resources  to  maintain 
errorless  performance  during  optimal  circumstances  will  be  overloaded  and 
begin  to  make  errors.  In  contrast,  the  operator  completing  a  task  with  a  lower 
workload,  will  have  an  available  cognitive  reserve  to  muster  in  the  face  of 
adverse  circumstances.  This  is  the  reason  that  a  sensitive  measure  of  workload 
may  provide  a  better  predictor  of  operational  performance  than  can  tested 
performance  itself. 

One  means  of  improving  the  sensitivity  of  performance  measures  is  to 
add  an  additional  task  that  will  use  all  available  cognitive  resources  even  during 
normal  testing  conditions  (Fisk,  Derrick,  &  Schneider,  1983).  This  procedure 
requires  that  the  operator  complete  two  tasks  concurrently;  one  is  the  task  of 
interest  (the  primary  task)  and  the  second  is  a  loading  task  used  to  push  the 
demands  on  the  operator's  resources,  even  during  the  lightest  primary  task 
workload  conditions.  This  is  known  as  a  dual  task  paradigm.  The  result  is  that 
operator  performance  of  the  combination  of  tasks  demands  all  cognitive 
resources  at  each  level  of  primary  task  workload.  In  this  way,  changes  in  that 
workload  will  be  accurately  reflected  in  changes  in  performance  of  one  or  both  of 
the  concurrent  tasks.  In  effect,  the  loading  task  is  acting  in  much  the  same  way 
that  the  adverse  circumstances  and  emergency  demands  of  the  operational 
setting  affect  cognitive  demands  and  in  turn,  adversely  affect  task  performance. 

Subjective  measures.  Operators  are  capable  of  describing  the  difficulty 
of  a  task.  Various  measurement  instruments  have  been  designed  to  quantify 
these  difficulty  evaluations  (Gopher  &  Donchin,  1986).  These  are  known  as 
subjective  measures  of  workload.  Since  the  cognitive  workload  involved  in  the 
completion  of  many  tasks  is  the  conscious  work  that  occurs  in  working  memory 
(i.e.,  short  term  memory),  the  amount  of  this  workload  is  available  for  analysis  by 
the  operators  themselves.  Hence,  numerous  publications  over  the  past  20  years 
have  reported  that  effectiveness  of  subjective  workload  metrics  is  assessing 
cognitive  workload.  In  addition  to  their  sensitivity  and  inferred  reliability,  these 
measures  have  face  validity  and  have  provided  validity  when  compared  with  task 
performance  (  Eggemeier,  McGhee,  &  Reid,  1983;  Boyd,  1983).  They  are 
relatively  inexpensive  to  collect  and  are  usually  nonintrusive  on  the  task  itself. 
That  is,  the  subjective  workload  measure  can  be  collected  without  interfering  with 
task  performance  (Eggemeier,  Melville,  &  Crabtree,  1984). 
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Criteria  for  Selection 


To  be  useful,  any  measurement  must  meet  four  criteria:  reliability,  validity, 
lack  of  contamination,  and  availability.  Successful  workload  metrics  should  meet 
four  additional  (and  not  strictly  independent)  criteria:  sensitive,  nonintrusive, 
diagnosticity,  and  cost  effectiveness.  The  goal  for  the  present  review  is  to 
determine  candidate  workload  measures  for  the  assessment  of  cognitive 
workload  in  telemedicine  applications.  Therefore,  each  candidate  class  of 
measures  and  each  measure  itself  will  be  evaluated  for  these  criteria. 

The  criteria  are  defined  and  described  in  the  following  section: 

1.  Reliability  is  the  repeatability  of  a  measure.  When  a  measure  is  reliable,  then 
repeated  occasions,  similar  tasks,  or  judges  will  obtain  similar  measurement 
levels.  Without  reliability,  a  measure  cannot  be  sensitive  or  valid.  Therefore, 
finding  validity  and  sensitivity  implies  that  reliability  exists;  however,  it  is  far  better 
to  assess  reliability  directly,  although  this  is  too  seldom  done  in  operational 
settings  (Lysaght  et  al.,  1989). 

2.  Validity  is  the  degree  to  which  a  metric  actually  measures  the  concept  it  is 
intended  to  measure.  For  example,  an  intelligence  test  is  valid  if  it  measures 
abilities  as  opposed  to  measuring  achievement. 

3.  Contamination  occurs  when  a  metric  is  confounded  with  other  influences, 
unrelated  to  the  measurement  of  interest.  For  example,  contamination  in 
workload  measures  would  occur  when  physical  effort  to  complete  the  workload 
assessment  confounds  the  measurement  of  cognitive  workload  for  the  task  per 
se.  Lack  of  contamination  is  important  to  any  satisfactory  metric. 

4.  Availability  indicates  the  ability  to  obtain  the  measurement.  Availability  may  be 
limited  by  access,  funding,  or  intrusiveness  into  the  task  domain  itself. 

5.  Sensitivity  is  the  extent  to  which  changes  in  the  item  to  be  measured  are 
reflected  by  changes  in  the  measuring  instrument.  Lack  of  sensitivity  will 
decrease  both  reliability  and  validity.  An  example  of  an  insensitive  workload 
measure  was  given  earlier  in  the  form  of  some  primary  task  performance 
measurements. 

6.  Intrusiveness  means  the  extent  to  which  performance  of  the  primary  task  is 
interrupted  by  the  workload  metric.  Any  concurrent  demands  for  obtaining  the 
measurement  of  workload  have  the  potential  to  intrude  on  the  primary  task,  but 
not  all  appear  to  do  so.  Nonintrusiveness  is  an  important  criterion  of  a  useful 
workload  metric. 
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7.  Diagnosticity  refers  to  the  ability  of  a  metric  to  determine  what  aspect  of  the 
task  is  the  source  of  the  imposed  workload,  that  is,  what  operator  resource  is 
more  severely  taxed  (see  Polzella  &  Reid,  1987  and  Vidulich  &  Wickens,  1986, 
for  contrasting  views).  If  an  unacceptably  high  workload  is  found,  then  a 
diagnostic  metric  will  pinpoint  the  cause  of  that  overload. 

8.  Cost  must  be  evaluated  against  the  value  obtained  from  knowing  the  workload 
information.  The  relationship  between  the  value  of  the  workload  information 
obtained  and  the  cost  of  obtaining  it  is  the  cost  effectiveness  of  the  metric. 

Application  of  Criteria  to  Workload  Metrics 

These  evaluation  criteria  can  be  applied  to  each  of  the  three  broad 
classifications  of  workload  metrics:  physiological,  performance  (primary  and 
dual),  and  subjective. 

Physiological  measures  have  been  found  to  lack  reliability  during  similar 
test-retest  conditions.  Furthermore,  when  multiple  physiological  measures  are 
obtained,  they  often  do  not  correlate  with  one  another  in  reflecting  changes  in 
cognitive  workload.  Without  reliability,  validity  is  not  possible;  therefore,  the 
question  of  validity  can  only  be  considered  when  a  physiological  measure  has 
been  found  to  be  reliable.  Physiological  measures  are  frequently  contaminated 
by  artifacts  from  other  physiological  activities  (e.g.,  eye  blinks,  breathing,  or 
muscle  movements).  Although  some  physiological  measures  can  be  obtained 
directly,  most  interest  in  the  assessment  of  cognitive  workload  (e.g.,  P300 
evoked  brain  potentials)  requires  the  use  of  high  technology  equipment  to 
measure  small  electrical  impulses,  separate  them  from  surrounding  signals,  and 
analyze  them  statistically.  The  sensitivity  of  these  measures  has  been  found  in 
some  cases,  but  often  it  is  not  found.  A  specific  application  of  P300  in  the 
measurement  of  perceptual  workload  has  been  found  when  using  a  secondary 
task  to  elicit  the  P300.  In  this  case,  some  diagnosticity  was  found  (Gopher  & 
Donchin,  1986).  Finally,  the  need  for  equipment  attached  to  the  operator  results 
in  very  intrusive  and  expensive  measurement  methodology. 

Performance  measures  might  be  thought  to  be  reliable  and  valid 
measures  of  cognitive  workload  just  by  their  definition.  This  statement  assumes 
that  performance  is  solely  the  result  of  cognitive  workload.  However,  especially 
when  using  only  a  primary  task,  this  has  not  always  been  the  case.  Employing  a 
second,  loading  task  has  improved  the  sensitivity  of  performance  as  an  indicator 
of  cognitive  workload.  Unfortunately,  the  use  of  dual  task  paradigms  may  result 
in  decrements  in  the  primary  task  or  the  loading  task  or  both,  as  workload 
increases.  This  may  compromise  the  safety  of  the  primary  task  in  an  operational 
setting,  and  even  in  an  experimental  setting,  it  makes  interpretation  of  the  results 
difficult.  The  only  sources  of  contamination  that  have  been  reported  are  the 
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cross  linking  of  demands  from  the  two  concurrent  tasks.  Intrusion  from  the 
loading  task  can  be  alleviated  by  careful  selection  of  the  loading  task  itself.  One 
successful  method  has  been  to  develop  imbedded  secondary  tasks  specific  to 
each  type  of  primary  task  being  evaluated.  The  costs  of  obtaining  performance 
measures,  whether  primary  or  loading  task  performance,  are  moderate. 

Reliable  subjective  measures  have  been  developed  (e.g.,  SWAT  and 
NASA-TLX).  This  cannot  be  claimed  for  all  subjective  workload  measures  that 
have  been  employed  (Gopher  &  Donchin,  1986).  Furthermore,  cluster  analyses 
(Derrick,  1983)  have  confirmed  that  these  measures  are  valid  in  assessing  a 
variety  of  the  cognitive  demands  that  impact  workload.  These  measures  can  be 
easily  contaminated  by  experimenter  expectations  and  operator  motivation. 

Care  must  be  taken  to  avoid  these  problems  when  using  subjective  workload 
measures,  and  the  procedures  for  administering  the  well-developed  metrics  have 
taken  these  precautions.  Standard  metrics  for  assessing  subjective  workload 
have  been  established  for  other  domains  such  as  flight  and  communication,  but 
they  have  not  been  employed  in  telemedicine  applications.  The  sensitivity  of 
some  metrics  has  been  found  to  accurately  reflect  changes  in  cognitive  workload 
demands  (e.g.,  signal  rate,  short  term  memory,  and  auditory  communication 
requirements)  (Eggemeier,  Crabtree,  &  LaPointe,  1983;  Moroney,  Biers,  & 
Eggemeier,  1995).  These  metrics  can  be  collected  after  the  primary  task  is 
completed,  and  hence,  they  are  nonintrusive;  their  cost  is  low.  See  Table  A-1  for 
a  summary  of  this  analysis. 

In  the  initial  analysis  of  the  three  broad  classes  used  to  measure  cognitive 
workload,  the  category  of  subjective  workload  metrics  is  the  most  satisfactory 
when  evaluated  by  these  test  and  evaluation  criteria.  They  meet  the  standards 
of  reliability,  validity,  and  lack  of  contamination.  Several  metrics  have  been 
standardized  and  have  been  tested  in  other  domains.  In  these  domains,  such 
metrics  have  been  found  to  be  sensitive  indicators  of  workload,  as  well  as 
predictors  of  task  performance.  In  general,  subjective  metrics  are  not  thought  to 
be  global  indicators  of  workload;  they  are  not  particularly  diagnostic  of  the  source 
of  this  overload.  They  are  the  least  expensive  of  all  metrics  (other  than 
observing  primary  task  performance  alone).  The  nonintrusive  nature  of 
subjective  workload  measures  is  a  very  important  criterion  for  their  use  in  the 
operational  settings  of  telemedicine  practices. 


Table  A-1 .  Evaluation  of  Broad  Workload  Classifications. 


Workload  Classification 


Criterion 

Physiological 

Performance 

Primary  Loading 

Subjective 

Reliability 

Poor 

Good 

Good 

Generally  good 

Validity 

Variable 

Variable 

Good 

Good 

Contamination 

Variable 

Variable 

Variable 

Good 

Availability 

Poor 

Good 

Variable 

Good 

Sensitivity 

Variable 

Variable 

Good 

Good 

Intrusive 

Poor 

Good 

Variable 

Good 

Diagnostic 

Good(P300) 

Poor 

Good 

Poor 

Cost 

Poor 

Good 

Moderate 

Good 

Subjective  Workload  Metrics 

Database.  A  more  detailed  analysis  of  subjective  workload  metrics  was 
used  to  select  the  most  promising  candidates  for  use  in  telemedicine.  This 
analysis  examined  human  factors  technical  and  psychology  electronic  databases 
using  the  terms:  workload  and  subjective;  cognitive  workload  and  subjective; 
mental  workload  and  subjective ;  plus  several  specific  metric  names -overall 
workload;  OW;  SWAT;  NASA-TLX;  TLX;  Cooper  Harper;  MCH.  From  the  titles 
accessed  by  this  search,  three  comprehensive  reviews,  three  meta-analyses, 
and  44  articles  describing  experimental  results  are  selected  as  a  comprehensive 
information  set  on  which  to  base  our  metric  selection  decisions. 

Background.  Moray  (1982)  published  a  comprehensive  review  of 
subjective  mental  workload  examining  the  literature  from  1968,  when  cognitive 
measures  of  performance  were  first  beginning  to  be  examined  by  the  human 
factors  community.  He  reports  few  studies  had  been  published  during  that  time, 
but  his  analysis  of  those  studies  is  particularly  helpful  for  the  present  task: 
selecting  workload  metrics  for  telemedicine  applications.  This  review  was 
divided  into  four  categories;  of  which,  three  are  relevant  to  the  cognitive 
demands  of  telemedicine  procedures;  these  are  cognitive,  manual  control,  and 
time  stress  tasks. 

For  the  analysis  of  cognitive  tasks,  a  global  measure  of  subjective 
workload,  such  as  "On  a  scale  from  1  to  9,  How  hard  is  this  task?”  was  found  to 
correlate  better  than  r  =  0.90  with  task  performance.  This  result  has  tended  to 
be  substantiated  by  experimental  results  in  the  ensuing  decade  when  primary 
task  paradigms  were  tested;  however,  global  subjective  workload  measures 


13 


have  been  found  to  dissociate  from  performance  when  dual  task  paradigms  or 
tasks  requiring  either  overlearned  (automated)  or  complex  responses  are 
employed  (Wickens  &  Yei-Yu,  1983;  Vidulich  &  Wickens,  1986). 

Manual  control  tasks  assessed  were  all  flight  control  tasks.  The  primary 
assessment  tool  was  a  subjective  rating  scale  of  handling  characteristics  called 
the  Cooper-Harper  (CH)  scale.  The  focus  of  this  review  was  on  the 
characteristics  of  the  manual  control  tasks  that  affected  subjective  workload. 

Both  order  of  control  and  display-to-response  lag  were  found  to  increase 
subjective  workload.  This  is  consistent  with  the  performance  literature  which  has 
found  increases  in  error  rates  with  as  little  lag  as  250  msec  in  speech  signals. 
The  upper  limit  on  lag  that  can  be  accommodated  at  all  in  continuous  manual 
control  tasks  is  5  seconds.  Furthermore,  the  requirement  to  complete  concurrent 
manual  control  tasks  and  the  introduction  of  instability  into  the  control  system 
also  reliably  increased  workload  ratings  on  the  Cooper-Harper  scale.  A  medical 
analogue  to  this  manual  control  task  is  found  in  lapyroscopic  gall  bladder  surgery 
when  more  than  one  manipulator  must  be  controlled  inside  a  patient's  closed 
abdomen.  This  lapyroscopic  surgery  is  analogous  to  teleproctored  surgery 
because  the  surgeon  must  view  the  operation  indirectly  through  a  display  on  a 
color  monitor.  If  remote  transmission  produces  a  lag  in  the  visual  display 
system,  a  major  source  of  documented  workload  will  be  introduced.  The  CH 
scale  has  been  found  to  be  sensitive  to  this  lag. 

Finally,  time  stress  has  been  an  important  driver  of  cognitive  workload  and 
is  a  factor  in  some  medical  procedures  considered  for  telemedicine  intervention 
(e.g.,  surgery,  emergency  room  medicine).  Philipp,  Reiche,  and  Kirchner  (1971) 
found  that  the  workload  for  air  traffic  controllers  who  were  on  duty  for  several 
hours  could  be  assessed  using  a  nine-point  scale  for  two  global  questions:  How 
difficult  is  the  task?  and  How  much  time  stress  is  there?  Objective  measures  of 
information  processed  and  time  pressure  for  communication  were  correlated  with 
the  two  subjective  measures.  These  correlations  were  rs  =  0.69  and  0.56, 
respectively,  indicating  a  significant  relationship  between  the  objective  and  the 
subjective  measures.  These  correlation  levels  are  well  within  the  accepted 
levels  for  measuring  validity. 

This  background  describes  the  subjective  workload  research  issues  that 
emerged  along  with  a  revival  of  general  interest  in  cognitive  psychology 
approximately  25  years  ago.  Subsequent  interest  in  this  method  of  assessing 
workload  has  resulted  in  a  number  of  tested  subjective  workload  assessment 
techniques.  These  metrics  are  described  next. 

Candidate  measurement  tools.  A  number  of  candidate  metrics  have 
been  developed  and  tested  (see  Lysaght  et  al.,  1989,  and  Boff  &  Lincoln,  1988, 
for  reviews).  The  present  analysis  targets  metrics  that  may  be  of  particular  use 
in  the  assessment  of  cognitive  workload  in  telemedicine. 
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Subjective  workload  metrics  may  be  divided  into  two  general  categories: 
rating  scales,  which  provide  quantitative  measures  of  subjective  workload,  and 
questionnaires  and  interviews,  which  provide  qualitative  information  and  lessons 
learned.  Many  measures  of  subjective  workload  have  been  developed  solely  for 
their  application  to  a  single  study  or  to  a  single  area.  These  measures  are  not 
discussed  since  only  measures  that  have  been  hopes  of  generalizing  from  other 
domains  are  reasonable  candidates  for  evaluating  telemedicine  applications. 
Several  ratings  scales  have  been  subjected  to  test  and  evaluation  development 
and  shown  to  be  valid  in  previous  research  and  will  be  considered.  They  are 
described  in  the  following  section: 

■  Cooper-Harper  Scale  (including  modified  Cooper-Harper)  is  a  widely 
used  metric  which  was  originally  developed  for  assessing  aircraft  handling 
capabilities.  It  has  been  found  to  be  a  sensitive  indicator  of  workload  for 
motor  or  psychomotor  tasks  (Wierwille  &  Connor,  1983).  A  modified  version 
called  the  Modified  Cooper-Harper  (MCH)  has  been  used  successfully  to 
assess  perceptual  and  cognitive  requirements  (Wierwille  &  Casali,  1983). 

One  factor  to  consider  in  using  the  CH  or  MCH  is  that  it  is  a  rating  scale 
which  produces  only  ordinal  scale  data,  thus  limiting  analysis  of  statistical 
significance  to  non-parametric  tests. 

■  NASA-Task  Load  Index  (NASA-TLX)  and  its  subscales  is  a  group  of  six 
scales  reflecting  separate  dimensions  of  workload  and  an  overall  workload 
rating  (Hart  &  Mashkati,  1988).  These  dimensions  include  cognitive  loading 
factors  such  as  time  pressure  and  mental  effort,  as  well  as  physical  factors 
such  as  amount  of  physical  effort.  The  rating  is  a  20-point  scale  which  is 
assumed  to  be  interval.  The  NASA-TLX  has  undergone  extensive  and 
rigorous  theoretical  development  and  evaluation.  Although  the  TLX  has 
been  used  most  extensively  to  evaluate  flight  tasks,  it  has  been  used  to 
assess  workload  in  laboratory  tasks  (e.g.,  short  term  memory,  visual  search, 
and  target  acquisition).  It  has  been  found  to  be  a  valid,  reliable,  and  sensitive 
measure  of  cognitive  workload.  The  TLX  is  preferred  to  the  longer  NASA- 
Bipolar  measure  because  of  the  easier  administration  of  the  TLX  and  the 
failure  to  demonstrate  an  advantage  of  the  Bipolar  version. 

•  Subjective  Workload  Assessment  Technique  (SWAT)  is  a  group  of  three 
scales  reflecting  separate  dimensions  of  workload:  time  pressure,  mental 
stress,  and  effort.  SWAT  has  undergone  extensive  theoretical  development 
and  has  been  evaluated  in  both  aviation  and  non-aviation  environments  (e.g., 
Eggemeier  &  Stadler,  1984;  Eggleston,  1984;  Heffley,  1983;  Detro,  1985). 
Use  of  conjoint  measurement  converts  these  subscale  ratings  into  a  single 
workload  measure  which  is  interval,  instead  of  ordinal  (Nygren,  1991).  This 
metric  has  been  found  to  be  a  sensitive,  reliable,  and  valid  measure  of 
cognitive  workload.  As  currently  used,  there  are  only  three  rating  levels  for 
each  dimension  (subscale).  As  the  result  of  current  test  and  evaluation 
studies  (see  Moroney,  Biers,  &  Eggemeier,  1995;  Biers  &  Mclnerney,  1988), 
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it  may  be  possible  to  eliminate  the  current  scaling  procedure  necessary  for 
conjoint  measurement.  This  scaling  procedure  (called  a  card  sort)  has  limited 
the  number  of  levels  on  each  subscale.  If  the  card  sort  is  not  used, 
increasing  the  levels  on  each  subscale  from  three  to  five  may  improve 
sensitivity  and  remove  floor  and  ceiling  efforts. 

Psychophysical  scaling  (e.g.,  magnitude  estimation)  asks  that  operators 
report  the  workload  imposed  by  a  task  in  comparison  to  some  other  task  or 
standard.  For  example,  using  magnitude  estimation ,  a  standard  task  will  be 
assigned  a  numerical  value  and  operators  are  asked  to  compare  a  task's 
workload  to  that  of  the  standard  task  by  assigning  a  numerical  value  to  the 
current  task.  Using  paired  comparisons ,  all  tasks  are  paired  and  the 
operator  chooses  the  one  of  the  pair  with  the  higher  subjective  workload  (see 
Acton,  Crabtree,  &  Simons,  1983,  for  an  application).  The  difficulty  with  this 
procedure  is  that  number  of  pairs  of  tasks  (n)(n-1)/2  increases  too  rapidly  as 
the  number  of  tasks  themselves  (n)  increases.  Equal-appearing  intervals 
asks  operators  to  assign  tasks  to  categories  judged  to  be  of  increasing 
difficulty.  The  categories  are  interval  scales.  Although  extensive  work  has 
been  done  in  the  development  of  psychophysical  scaling  techniques  for 
judging  laboratory  stimuli,  little  work  has  been  reported  from  operational 
settings  or  from  workload  measurement.  The  potential  is  there,  but  it  awaits 
further  work  to  determine  its  applicability. 

Stockholm  Scales  are  the  result  of  early  work  at  the  University  of  Stockholm 
in  the  development  of  a  univariate  (non-dimensional)  measure  of  workload. 
This  measure  was  validated  using  items  on  an  intelligence  test  which 
measured  spatial  ability,  reasoning  ability,  and  verbal  comprehension.  (Note 
that  all  these  tasks  are  processed  in  conscious,  or  working,  memory  and 
hence  should  be  readily  available  for  subjective  evaluation  by  the  subject.) 
The  reliability  and  validity  as  measured  by  this  evaluation  were  very  high.  An 
1 1 -point  version  of  this  scale  was  used  to  assess  spare  mental  capacity  in  a 
dual  task  paradigm  using  laboratory  tasks.  These  tasks  were  either 
perceptually  demanding  (e.g.,  target  acquisition)  or  demanding  of  central 
processing  capacity.  In  both  cases,  the  Stockholm  Scale  was  found  to 
correlate  well  with  performance  and  secondary  task  measures  of  spare 
capacity  (i.e.,  it  was  sensitive  to  changes  in  the  primary  task  difficulty.)  The 
scale  is  designed  to  measure  effort  as  available  spare  central  processing 
capacity,  not  motor  or  psychomotor  control. 

Overall  workload  (OW)  Each  of  the  scales  described  can  be  used  as  an 
overall  workload  measure;  some  (e.g.,  NASA-TLX  and  SWAT)  also  have 
subscales  that  may  allow  diagnostic  analysis  of  the  source  of  the  workload, 
when  overload  occurs  (Hendy,  Hamilton,  &  Landry,  1993).  The  initial  focus  of 
a  workload  analysis  is  to  determine  whether  there  is  an  overload  that  must  be 
remedied  before  the  task  can  be  completed  safely  with  a  reasonable  degree 
of  operator  workload.  If  overload  is  found,  further  cognitive  task  analysis  can 
be  used  to  evaluate  the  cause. 
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Conclusions 


The  three  measurement  scales  that  have  undergone  most  extensive 
theoretical  development  and  are  most  relevant  for  the  present  evaluation  are  the 
Modified  Cooper-Harper  (MCH),  NASA-TLX,  and  SWAT.  Each  has  been  shown 
to  be  a  valid  and  reliable  predictor  of  workload  in  several  fields.  Of  the  three,  the 
MCH  appears  to  be  the  most  likely  to  measure  any  motor  or  psychomotor 
components  of  a  medical  procedure.  The  NASA-TLX  has  had  less  testing 
outside  the  aviation  world  than  has  SWAT,  but  it  has  been  shown  to  correlate 
well  with  SWAT  and  MCH  results  in  those  cases  in  which  two  or  more  of  these 
metrics  have  been  tested  together  (e.g.,  Vidulich  &  Tsang,  1986;  Warr,  Colle,  & 
Reid,  1986;  see  also  Lysaght  et  al.,  1989,  for  a  summary  review).  SWAT  has 
been  found  to  be  a  sensitive  predictor  of  increasing  task  difficulty,  measuring 
increased  workload  before  the  point  that  task  difficulty  leads  to  a  decrement  in 
performance  (Whitaker,  Peters,  &  Garinther,  1989). 

Each  of  these  metrics  has  been  found  to  be  more  or  less  sensitive  to 
changes  in  task  difficulty  depending  upon  the  domain  in  which  they  have  been 
used.  This  domain-specific  aspect  requires  that  comparisons  be  made  among 
these  candidate  measures  to  determine  which  is  the  most  effective  in  evaluating 
cognitive  workload  for  various  telemedicine  procedures. 
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Abstract 

The  work  described  in  the  present  paper  was  conducted  to  provide  the 
development  and  empirical  testing  of  a  research  paradigm.  The  use  of 
this  paradigm  is  to  select  the  optimal  metric  for  evaluating  cognitive 
workload  during  telemedicine  applications.  This  effort  included  the 
development  and  norming  of  difficulty  levels  of  a  surrogate  task  in  a 
controlled  experimental  protocol,  the  selection  of  a  spatial  abilities  test, 
acquisition  and  testing  of  required  telecommunication  and  recording 
equipment,  and  the  iterative  development  and  testing  of  a  research 
protocol.  These  processes  and  their  results  are  described  in  detail  in  this 
report. 
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Introduction 


Telemedicine,  as  a  communication  form,  may  produce  additional 
demands  on  a  health  care  provider.  There  is  a  need  to  measure  changes  in 
workload  resulting  from  such  additonal  demands.  The  purpose  of  the  research 
reported  here  was  to  test  a  paradigm  for  identifying  a  metric  to  assess  the 
workload  in  telemedicine  applications. 

A  performance  task  was  developed  (puzzle  patterns)  and  the  difficulty 
level  of  each  pattern  was  assessed  in  an  experimental  protocol,  a  measure  of 
individual  differences  was  obtained  and  tested  (Cognitive  Laterality  Battery),  and 
a  test  of  the  surrogate  task  and  telecommunication  equipment  was  completed. 
The  results  of  this  effort  are  described  in  the  present  report. 

Development  and  Assessment  of  Performance  Task: 

Puzzle  Patterns 

Rationale.  To  maintain  experimental  control,  as  well  as  for  ethical  and 
legal  reasons,  it  was  not  possible  to  use  an  actual  medical  procedure  in  the 
planned  assessment  of  workload  metrics.  Therefore,  an  alternate  task  that 
shared  the  cognitive  demands  of  such  procedures  was  needed.  The  following 
demands  were  considered  to  be  essential  for  a  surrogate  task: 

1 .  Teamwork-Teamwork  between  at  least  two  team  members  is  required.  In  a 
telemedicine  application,  at  least  one  person  is  located  remotely.  He  or  she 
is  communicating  with  either  another  health  practitioner  or  a  patient  at  a 
distance. 

2.  Visual-Spatial  Requirement — Many  telemedicine  applications  require  the 
transmission  of  video  images  to  be  evaluated  by  a  remotely  located 
specialist.  Therefore,  the  task  needed  to  incorporate  the  visual-spatial 
requirements  of  those  telemedicine  applications. 

3.  Communication-A  communication  component  was  needed  because  one  way 
in  which  face-to-face,  also  called  co-located,  conditions  and  telemedicine 
conditions  differ  is  in  the  need  for  one  health  care  practitioner  to  provide 
information  to  a  remotely  located  health  care  practitioner  via  audio-video 
channels. 

4.  Performance  Demands-The  task  needed  to  place  accuracy  and  time 
pressure  constraints  on  the  team  members  so  that  the  outcome  will  produce 
sensitive  performance  indicators.  In  this  way,  it  is  possible  to  assess  the 
correlation  between  successful  task  execution  and  subjective  workload. 

5.  Psychomotor  Component — Many  medical  procedures  have  a  large 
psychomotor  component.  The  ultimate  goal  for  the  selected  surrogate  task  is 
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that  it  will  be  useful  for  assessing  workload  during  medical  procedures. 
Therefore,  a  task  which  has  a  psychomotor  component  was  needed. 


Development  of  a  Surrogate  Task.  Using  this  rationale  as  the  basis  for 
selecting  a  surrogate  task,  a  search  for  existing  normed  and  validated  tasks  was 
conducted.  A  potential  match  was  found  in  the  spatial  abilities  Block  Pattern 
task  of  the  WAIS-R  Intelligence  Test  (Wechsler,  1981).  In  this  test,  a  person  is 
suppose  to  construct  a  two-color  pattern  from  blocks,  which  matches  the  pattern 
shown  on  a  display  card.  The  WAIS-R  contains  five  four-block  patterns  and 
four33333333333  nine-block  patterns.  This  task  met  each  of  the  four  cognitive 
criteria  established  for  selecting  a  surrogate  task  and  can  be  modified  to  include 
a  psychomotor  component.  Furthermore,  it  has  validity  in  that  manipulation  of 
blocks  to  form  a  pattern  is  used  in  the  training  of  surgeons  for  opthomalic 
procedures. 

Available  Normina  Data.  The  test  manual  claims  that,  during  the 
development  of  the  WAIS-R,  performance  data  were  obtained  to  measure  the 
difficulty  of  the  nine  patterns.  However,  these  norming  data  were  not  available 
from  either  the  research  department  or  the  legal  department  of  Psychological 
Corporation,  despite  repeated  inquiries.  The  following  information  was  available: 
(a.)  the  earlier  patterns  are  easier  than  the  later  patterns,  and  (b.)  all  four-block 
patterns  are  easier  than  all  nine-block  patterns.  Therefore,  only  ordinal  scaling 
was  assumed  and  the  number  of  difficulty  levels  was  not  known. 

Creating  additional  patterns.  The  design  of  the  experimental  protocol  for 
the  application  of  this  surrogate  task  was  going  to  require  as  many  as  54 
different  puzzle  patterns.  The  WAIS-R  provided  only  nine  patterns.  Therefore,  it 
was  necessary  to  develop  many  additional  patterns.  These  additional  patterns 
were  developed  in  the  following  ways: 

•  The  original  pattern  was  rotated  90°  or  180°.  A  rotation  of  30°  is 
scored  as  a  different  pattern  in  the  WAIS-R;  therefore,  any  rotation 
greater  than  30°  should  be  discriminable. 

•  The  colors  were  reversed 

•  A  random  change  was  made  to  one  of  the  original,  rotated,  or  reversed 
patterns  to  generate  a  discriminable  pattern  with  a  similar  appearance. 

Four  of  the  original  WAIS-R  patterns  were  used  and  11  rotated  patterns, 
five  color  reversal  patterns,  and  36  random  alteration  patterns  were  added  to  the 
original  set  to  produce  a  complete  set  of  56  puzzle  patterns.  Each  pattern  was 
assigned  a  letter  code  ranging  from  A  through  ddd  in  random  order.  These  56 
patterns  are  shown  in  Figure  B-1. 
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Figure  B-1 .  Puzzle  patterns  developed  for  experimental  paradigm. 
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BLOCK  PATTERN  CODES 


4  =  4  block  pattern 
9  =  9  block  pattern 
VE  =  very  easy  pattern 
E  =  easy  pattern 
M  =  moderate  pattern 
D  =  difficult  pattern 


W  =  WAIS-R  original  pattern 
C  =  color  reversal  of  a  WAIS-R  pattern 
R  =  rotated  WAIS-R  pattern 
X  =  pattern  created  by  experimenter 
P  =  pattern  used  for  practice  only 


1.)  4-E-C 

13.)  4-E-R 

25.)  4-E-X 

37.)  9-H-X-P 

49.)  9-M-X-P 

2.)  4-E-C 

14.)  4-E-X 

26.)  4-E-X 

38.)  9-H-X 

50.)  9-M-X 

3.)  4-VE-X 

15.)  4-E-R 

27.)  4-VE-X 

39.)  9-H-X 

51.)  9-M-X 

4.)  4-E-X 

16.)  4-E-X-P 

28.)  9-M-X 

40.)  9-H-X 

52.)  9-H-W 

5.)  4-E-X 

17.)  4-VE-C 

29.)  9-M-X 

41.)  9-H-X 

53.)  9-M-X 

6.)  4-VE-X 

18.)  4-VE-W 

30.)  9-M-X 

42.)  9-H-X 

54.)  9-M-X 

7.)  4-VE-X 

19.)  4-E-R 

31.)  9-H-R 

43.)  9-H-R-P 

55.)  9-H-R 

8.)  4-VE-W 

20.)  4-VE-R-P 

32.)  9-H-R 

44.)  9-M-X-P 

56.)  9-M-X 

9.)  4-E-X 

21.)  4-VE-X-P 

33.)  9-H-R 

45.)  9-M-W 

10.)  4-VE-R 

22.)  4-VE-X 

34.)  9-M-X 

46.)  9-H-X-P 

11.)  4-VE-X 

23.)  4-VE-R 

35.)  9-M-R 

47.)  9-H-X 

12.)  4-VE-C 

24.)  4-E-W 

36.)  9-M-X 

48.)  9-H-X 

29 


Assessing  Pattern  Difficulty.  Numerous  scaling  methods  can  be  used  to 
assess  perceived  task  difficulty.  Two  have  been  found  to  be  sensitive,  reliable, 
valid,  uncontaminated,  and  manageable:  magnitude  estimation  and  rank 
ordering  (Kling  &  Riggs,  1972).  These  two  methods  and  their  application  to  this 
assessment  are  described  next. 

Magnitude  estimation  asks  the  observer  to  assign  a  number  to  each  item 
being  assessed.  This  number  is  to  reflect  the  level  of  the  variable  being 
assessed  (in  this  case,  pattern  difficulty).  A  range  of  possible  magnitudes  is 
given  and  sometimes  an  anchoring  value  is  used,  although  this  anchor  can  lead 
to  distortions.  Magnitude  estimation  can  produce  interval  scaled  data.  In  this 
specific  case,  an  observer  was  shown  a  set  of  cards,  each  showing  one  of  the 
27  four-block  patterns.  The  observer  was  allowed  to  look  at  each  of  the  patterns 
and  to  make  any  comparisons  while  examining  the  set.  Next  the  experimenter 
shuffled  the  cards  and  then  showed  the  cards  one  at  a  time  and  asked  the 
observer  to  assign  a  magnitude  between  one  and  fifty  to  each  card.  The  29 
nine-block  patterns  were  assessed  in  the  same  way  except  that  the  range  of 
magnitudes  was  51  to  100. 

Rank  Ordering  asks  the  observer  to  place  the  items  in  an  order  of 
increasing  value  on  the  variable  being  assessed.  Rank  ordering  produces 
ordinal  scaled  data.  In  this  case,  after  completing  the  magnitude  estimation  task 
for  the  four-block  patterns,  the  experimenter  again  shuffled  the  cards  and  then 
asked  the  observer  to  place  the  cards  in  order  from  the  easiest  to  the  most 
difficult  pattern.  After  completing  the  magnitude  estimation  task  for  the  nine- 
block  patterns,  the  observer  rank  ordered  this  set. 

Results.  Eight  independent  observers  were  asked  to  provide  magnitude 
estimations  and  rank  orderings  of  the  56  puzzle  patterns.  Three  types  of 
statistical  analyses  were  conducted:  correlations,  descriptive  statistics,  and 
regression.  First,  correlations  were  computed  to  assess  the  reliability  of  these 
judgments  within  raters  (comparing  magnitude  estimation  to  ranking)  and 
between  raters  on  each  scaling  method.  See  Table  1  showing  stem-and-leaf 
plots  of  these  three  reliability  distributions.  Mean  inter-rater  reliability  in  the 
range  of  r=  .80  and  above  is  considered  to  be  satisfactory  for  testing  instruments 
(Guilford,  1956). 

•  Intra-rater  reliability  between  magnitude  estimations  and  rank  orderings  was 
assessed  using  the  Spearman’s  rho  because  the  rank  orderings  are  ordinal 
data.  Rho  ranged  from  .88  to  .97  with  a  median  of  .94. 

•  Inter-rater  reliability  for  the  magnitude  estimations  was  assessed  using 
Pearson’s  r.  The  r  ranged  from  .70  to  .97  with  a  mean  of  .88. 
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Inter-rater  reliability  for  the  rank  orderings  was  assessed  using  Spearman’s 
rho.  The  rho  ranged  from  .73  to  .96  with  a  median  of  .87. 


Table  B-1:  Stem-and-Leaf  Plots  of  Intra-  and  Inter-Rater  Reliabilities.  Stem  and 
leaf  plots  are  a  method  of  displaying  frequency  distributions  in  a  summary  form 
while  still  retaining  the  individual  data  values.  For  example,  the  individual  r 
values  for  the  intra-rater  reliabilities  are  .88,  .89,  .92,  .94,  .95,  .97,  .97,  .97. 


Intra-Rater  Reliabilities 

.8  8  9 

.9  24  5  77  7 


Inter-Rater  Reliabilites  (Magnitude  Estimations) 
.7  0  9  9 

.8  0044478888899 

.9  011123445557 

Inter-Rater  Reliabilities  (Rank  Orderings) 

.7  3  6788 

.8  0  1  34555677778889 

.9  00  1  1  245 


Second,  the  mean  and  standard  deviations  of  the  magnitude  estimation 
for  each  card  were  calculated  to  define  the  pattern’s  difficulty  level.  Magnitude 
estimations  were  used  because  they  are  interval  data,  while  ranks  are  only 
ordinal.  When  two  measures  have  similar  reliabilities,  the  interval  measure 
allows  more  powerful  statistical  manipulations  (e.g.,  mean  instead  of  median). 
Table  B-2  provides  the  individual  magnitude  estimations  from  each  observer  and 
the  mean  magnitude  estimation.  Table  B-3  provides  the  individual  rank 
orderings  and  the  median  rank  ordering. 
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Table 


B-2:  Magnitude  Estimations  for  Individual  Raters  and  Mean 
Estimation  for  Each  Puzzle  Pattern  Card 


CARD 

MAG  1 

MAG  2 1  MAG  3 

MAG  4 

MAG  5 

CD 

o 

i 

MAG7 

00 

0 

< 

2 

AVG  MAG 

STDEV 

1 

50 

40 

30 

45 

2 

40 

30 

45 

38.13 

8.84 

2 

3 

7 

20 

40 

40^ 

25 

25 

25 

23.13 

13.39 

3 

10 

3 

2 

15 

5 

20 

10 

1 

8.25 

6.76 

4 

48 

45 

35 

40 

45 

40 

351 

43 

41.38 

4.75 

5 

50 

20 

35 

40 

50 

35 

45 

38 

39.13 

9.80 

6 

__ 

2 

25 

20 

10 

35 

5 

20 

20 

17.13 

10.91 

7 

...  J 

1 

1 

2 

5 

1 

5f 

5 

7 

3.38 

2.39 

8 

25" 

10 

5 

10 

15] 

20 1 

20 

27 

16.50 

7.80 

9_ 

45 

25 

25 

30 

50 

40  i 

40 

28 

35.38 

9.62 

8  J5 
15  10 


20  40 

20  15 


85  100 

70 _ 55 

80  65 


55  100 
65  80 


15.88 

13.13 

17.25 

32.88 

27.13 
31.00 

19.50 

15.30 
18.00 

24.75 
19.00  ~ 
19.13' 

11.38 

16.88 

27.38 

23.88 

31.25 

4.75 

73.88 

70.25 

70.38 

82.50 

79.63 
"78.75 

73.00 

62.63 
68.00 
77.00 

77.13 

77.75 

79.38 
80.00 

79.63 
77,00' 

74.50 

62.88 

76.38 

78.50 

81.63 

74.50 
70.00 


12.53 
11.73 

11.30 

9.88 

13.75 
10.80 
10.52 
10.81 
8.14 
11,40 
7.05 
15.04 
7.21 
9.61" 
1 2.24j 
13.04 
10.32 
4.92 
12.22 
11.49 
~  9.66 
9.32 
8.91 

12.54 
12.95 

8.62 

11.31 

10.34 
8.69 

14.34 
15.84 

8.60 
7.96 

12.39 
12.08 
7.74 
9.84 
_  7.76 
16.16 

16.55 
9.64 
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51 

65 

60 

70 

75 

90 

65 

80 

77 

72.75 

9.74 

52 

100 

75 

85 

65 

85 

85 

70 

82 

80.88 

10.84 

53 

o 

co 

60 

80 

51 

53 

75 

75 

78 

69.00 

12.20 

54 

56 

55 

75 

55 

75 

70 

70 

68 

65.50 

8.77 

55 

CM 

CD 

— 

80 

90 

51 

96 

75 

90 

78 

78.00 

15.52 

56 

I  59: 

52 

75 

100 

55 

75 

60 

53 

66.13 

16.43 
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Table 


CARD 

1 

2 

3 

4 

5 

6 

7 

8 
9 

To 


B-3:  Rank  Orderings  for  Individual  Raters  and  Average 
(MDN  and  Mean)  for  Each  Puzzle  Pattern  Card 


RANK  1  RANK  2  RANK  3  RANK  4  RANK  5  RANK  6 


20 

23! 

14 

22! 

3 

5! 

25 

27 

27 

25 

16| 

2 

2 

r 

10 

18 

23 

19 

7 

9i 

4 

7 

15 

6 

22 

21 

11 

20 

21 

26 

17 

24 

5 

16 

12 

10 

18 

13 

9 

17 

6 

8 

8 

4 

13 

11 

19 

14 

24 

12 

26 

15 

1 

3 

39 

47 

38 

38 

35 

37 

49 

44 

42 

48 

54 

49 

56 

36 
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Finally,  a  regression  analysis  was  calculated  to  provide  a  visual 
representation  of  the  reliability  and  the  mean  trend  for  the  assessed  difficulty  of 
these  56  patterns.  The  analysis  regressed  the  individual  magnitude  estimations 
(as  the  dependent  variable)  against  the  mean  magnitude  estimation  (as  the 
independent  variable).  First,  regression  was  computed  on  the  entire  set  of  56 
cards  (including  both  the  four-and  the  nine-block  patterns).  The  linear  regression 
equation  was  Y'  =  1.64  X  +  4.12  and  the  R2  =  .82  or  explaining  82%  of  the 
variance  (see  Figure  B-2). 


Figure  B-2.  Regression  equation  and  scatterplot  showing  magnitude  estimates 
for  all  56  cards. 
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A  portion  of  the  strong  correlation  reflects  the  clear  separation  between 
the  assessed  difficulty  of  the  four-block  and  the  nine-block  patterns.  This  is 
consistent  with  the  information  obtained  from  the  WAIS-R  manual  and  with  the 
manner  in  which  magnitude  estimations  were  assigned  (1  to  50  for  four-block 
and  51  to  100  for  nine-block).  Clearly,  two  levels  of  difficulty  exist  in  the  total  set 
of  56  patterns. 

Next,  the  regression  was  calculated  within  each  of  the  two  pattern  sets 
(four-  and  nine-block  patterns): 

•  The  regression  equation  for  the  four-block  patterns  was  Y’ =  1.14  X  + 7.24 
and  the  R2  =  .49  or  explaining  49%  of  the  variance.  The  F-test  of  the 
significance  of  the  explained  variance  (greater  than  using  the  set  of  four-block 
patterns  as  a  single  undifferentiated  difficulty  level)  is  F  =  182.8, 
p  <  .01  (see  Figure  B-3). 


Figure  B-3.  Regression  equation  and  scatterplot  showing  magnitude  estimates 
for  the  four-card  patterns. 
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•  The  regression  equation  for  the  nine-block  patterns  was  Y’  =  .58  X  +  52.32 
and  the  R2  =  .20  or  explaining  20%  of  the  variance.  The  F-test  of  the 
significance  of  the  explained  variance  (greater  than  using  the  set  of  nine- 
block  patterns  as  a  single  undifferentiated  difficulty  level)  is  F  =  59.9,  p<  .01 
(see  Figure  B-4). 


Figure  B-4.  Regression  equation  and  scatterplot  showing  magnitude  estimates 
for  the  nine-card  patterns. 
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The  significant  F  value  for  each  regression  analysis  indicates  that  there  is 
a  reliable  change  in  the  difficulty  level  within  both  the  four-block  and  the  nine- 
block  patterns  as  well  as  between  the  two  pattern  sets.  These  results  are 
consistent  with  using  the  56  patterns  at  four  separate  levels  of  difficulty.  There 
may  be  more  separable  difficulty  levels,  but  four  will  be  a  practical  number  to  test 
the  workload  metrics  in  the  proposed  research.  A  quartile  split  was  used  to 
define  four  pattern  difficulty  levels:  very  easy,  easy,  moderate,  and  hard. 

Assessing  Spatial  Ability: 

Cognitive  Laterality  Battery 

Rationale  for  Measuring  Spatial  Ability.  Individual  differences  in  subject’s 
ability  to  perform  various  tasks  can  possibly  cloud  the  results  obtained  in 
experimental  research.  Therefore,  investigators  either  control  this  possibility  by 
holding  individual  difference  variables  constant  or  by  measuring  such  variables 
and  stratifying  the  sample  to  allow  the  measurement  of  their  impact.  In  the 
present  study,  two  of  these  variables  might  be  verbal  intelligence  and  spatial 
ability.  Intelligence  within  either  a  sample  of  university  students  or  a  sample  of 
medical  personnel  is  not  likely  to  vary  greatly.  Selection  into  these  populations 
has  already  greatly  restricted  the  range  since  verbal  intelligence  is  highly 
correlated  with  academic  success.  However,  spatial  ability  may  range  widely 
within  either  population  because  it  is  not  so  directly  correlated  with  any  selection 
procedure  for  academic  success.  Hence,  a  measure  of  spatial  ability  was  sought 
with  which  to  stratify  the  subjects  in  our  proposed  experiment.  By  this  means,  it 
would  be  possible  to  determine  whether  spatial  ability  was  a  variable  affecting 
performance  of  the  task  in  general,  or  interacting  with  either  task  difficulty  level  or 
communication  method  (co-location  vs.  telemedicine). 

Selecting  Spatial  Ability  Instrument  (Cognitive  Laterality  Battery).  A 
spatial  ability  test  battery  called  the  Cognitive  Laterality  Battery  has  been 
developed,  validated,  and  normed  by  Gordon  (1987).  The  entire  test  is  a 
cognitive  laterality  battery  intended  to  determine  the  specialized  functioning  in 
each  cerebral  hemisphere.  Of  interest  for  the  present  research  are  the  four 
subscales  (tests)  that  comprise  the  measurement  of  spatial  ability.  These  four 
tests  are  called  localization,  orientation,  form  completion,  and  touching  blocks. 
Each  measures  some  aspect  of  spatial  ability,  and  collectively,  they  provide  a 
reliable  measure  of  this  ability. 

Collect  Battery  Test  Materials.  The  Cognitive  Laterality  Battery  is 
available  commercially  in  a  package  that  includes  all  administration  instructions, 
stimulus  materials  in  the  form  of  slides  and  taped  instructions,  data  sheet 
templates,  and  scoring  instructions/answer  keys.  In  addition,  the  norming  data 
(means,  standard  deviation,  and  frequency  distributions)  for  several  populations 
are  provided. 
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Equipment  to  Administer  Cognitive  Laterality  Battery.  To  administer  the 
four  spatial  ability  tests,  the  following  equipment  is  used:  slide  projector  (Kodak 
Carousel  5400)  and  tape  recorder/player  (General  Electric  #3-5622A).  The 
administration  of  the  tests,  including  their  instructions  and  material  distribution, 
requires  approximately  60  minutes;  of  this  time,  30  minutes  is  required  for  actual 
data  collection  (time  spent  viewing  the  stimuli  and  marking  responses).  Subjects 
can  be  tested  in  groups  of  as  many  as  10  people,  depending  upon  the  viewing 
conditions.  It  is  necessary  for  each  subject  to  be  able  to  see  clearly  the  stimulus 
slides  projected  on  a  screen. 

Spatial  Ability  Subscales.  The  four  spatial  ability  subscales  (i.e., 
localization,  orientation,  form  completion,  and  touching  blocks)  are  described 
below: 

•Localization  is  a  test  of  the  observer’s  ability  to  reproduce  the  location  of 
an  x  marked  on  a  projected  slide  by  marking  its  corresponding  location  on 
a  paper  template.  There  are  24  slides. 

•Orientation  is  a  mental  rotation  task.  Observers  view  three  3D  geometric 
figures  and  determine  which  two  figures  are  actually  the  same  object. 
There  are  24  tasks. 

•Form  Completion  consists  of  line  drawings  of  common  figures  with 
portions  of  the  line  segments  erased  (missing).  The  observer’s  task  is  to 
name  the  figure.  There  are  24  figures. 

•Touching  Blocks  shows  a  stack  of  blocks  in  which  some  blocks  are 
numbered.  The  observer’s  task  is  to  count  the  number  of  blocks  touching 
all  the  numbered  blocks.  There  are  six  stacks. 

Code  results.  The  results  are  scored  by  referring  to  the  answer  key  for 
each  test,  except  the  location  subscale.  The  location  subscale  requires  that  the 
experimenter  score  the  distance  in  millimeters  that  the  observer’s  response  is 
from  the  target  location.  This  is  a  time-consuming  scoring  procedure,  even  using 
the  template  provided  in  the  test  booklet. 

Tabulate  results.  The  results  can  be  used  as  subscale  values  so  that  they 
can  be  compared  to  the  adult  norming  values  for  each  subscale  in  the  CLB 
manual.  Alternatively  a  general  spatial  abilities  score  can  be  obtained  by  adding 
all  the  subscale  scores  for  a  given  subject.  The  score  for  the  localization 
subscale  is  an  error  measurement  and  hence  is  negatively  correlated  with  spatial 
ability.  Therefore,  the  actual  localization  score  can  be  subtracted  from  any 
constant  larger  than  the  largest  error  score  in  the  sample.  This  transformed 
score  will  then  be  positively  correlated  with  spatial  ability  and  can  be  added  to 
the  remaining  three  subscale  scores  to  obtain  a  total  spatial  ability  measure  for 
each  subject. 
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Application  for  Proposed  Testing.  The  four  spatial  ability  subscales  of  the 
Cognitive  Laterality  Battery  are  available,  reliable,  validated,  uncontaminated, 
and  manageable  methods  of  measuring  spatial  ability.  The  CLB  is 
recommended  as  a  satisfactory  method  of  stratifying  spatial  ability  among  the 
university  students  who  will  serve  as  subjects  in  the  first  proposed  experiment. 


Developing  Test  Paradigm  Procedures 

The  final  activity  in  completing  testing  of  this  paradigm  was  to  design  and 
test  the  research  protocol  itself.  A  generic  workload  measure  was  sought  which 
will  assess  the  cognitive  requirements  which  are  likely  to  be  found  in  most 
medicine  procedures.  Furthermore,  there  is  a  specific  interest  in  targeting  the 
changes  in  cognitive  workload  that  occur  with  the  introduction  of 
telecommunication  for  those  procedures.  Hence,  a  research  protocol  to  test  the 
interaction  of  three  variables  was  designed.  The  three  variables  are:  type  of 
workload  metric,  task  difficulty,  and  communication  condition.  A  measure  for 
stratifying  subjects  by  spatial  ability  was  included. 

Workload  Metric.  The  selection  of  three  candidate  workload  is  described 
in  Appendix  A.  These  three  candidate  workload  measures  are  the  Subjective 
Workload  Assessment  Technique  (SWAT)  (Reid  &  Nygren,  1988),  the  NASA- 
Task  Load  Index  (TLX)  (Hart  &  Mashkati,  1988),  and  the  Modified  Cooper 
Harper  (MCH)  (Boff  &  Lincoln,  1988). 

Task  Difficulty.  A  surrogate  puzzle  pattern  task  was  developed  as 
described  earlier.  The  magnitude  estimations  of  difficulty  were  used  to  produce 
four  separate  levels  of  task  difficulty  which  will  be  used  to  assess  the  sensitivity 
of  the  three  workload  metrics.  Thirteen  patterns  of  each  difficulty  level  were 
designed  and  tested. 

Communication  Condition.  The  two  communication  conditions  are  co- 
location  and  telecommunication.  In  the  co-location  condition,  the  two  team 
members  are  located  in  the  same  room  and  view  the  working  area  directly.  In 
the  telecommunication  condition,  they  are  located  in  separate  rooms  and  have  to 
communicate  via  video  and  audio  communication. 

Spatial  Ability.  The  four  types  of  spatial  ability  teams  are  constructed  by 
using  the  subjects’  scores  on  the  spatial  ability  subscales  of  the  Cognitive 
Laterality  Battery.  The  four  types  of  teams  are  High:High,  Low:High,  High:Low, 
Low:  Low-in  which  the  first  member  is  the  instructor  and  the  second  is  the 
builder. 


Equipment.  To  test  the  feasibility  of  the  anticipated  experiment,  it  was 
necessary  to  determine  the  telecommunication  equipment  that  would  be  used  in 
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the  experiment.  The  major  video  components  of  this  equipment  were  obtained 
as  a  loan  from  the  US  Army  Research  Laboratory  at  Aberdeen  Proving  Ground, 
Maryland.  These  consisted  of  two  video  cameras  (Panasonic  VHS  AG  160 
Proline  camcorder  and  AC  adapter)  and  two  television  monitors  (19-inch  Zenith 
Model  #  L1912W).  Additional  equipment  was  obtained  from  local  sources.  This 
equipment  consisted  of  two  TRC-512,  49  MHz  FM  Radio  Shack  wireless 
transmitter-receivers  (“walkie-talkies”)  to  permit  audio  communication  between 
the  team  members  in  the  telemedicine  condition,  a  RST-84V  Radio  Shack  tripod, 
and  a  25-foot  coaxial  cable  to  connect  the  remote  monitor  to  the  camcorder. 

See  Figure  B-5  for  diagram  of  the  equipment  setup. 


Door 

to 


Monitor  Connections 


Figure  B-5.  Diagrams  of  equipment  set-up. 
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Door 


Camcorder  Connections 


Figure  B-5.  Diagrams  of  equipment  set-up. 
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Design.  To  select  the  best  workload  metric  for  use  in  evaluating 
telemedicine  applications,  the  following  mixed  factors  design  with  three 
independent  variables  was  developed.  Spatial  ability  of  teams  is  varied  at  four 
levels:  high/high,  high/low,  low/high,  and  low/low.  The  remaining  variables  are 
both  repeated  measures:  communication  condition  (co-located  versus 
telecommunication)  and  four  levels  of  puzzle  pattern  difficulty  (very  easy,  easy, 
moderate,  and  hard).  Each  level  of  puzzle  difficulty  occurs  on  a  total  of  12  trials. 
Half  of  these  are  in  the  co-located  and  half  in  the  telecommunication  condition. 

In  each  communication  condition,  workload  for  two  of  the  trials  is  assessed  using 
each  of  the  three  workload  metrics  (SWAT,  NASA-TLX,  and  MCH).  Thus,  a  total 
of  48  trials  (puzzle  patterns)  are  completed  by  each  team.  A  diagram  of  this 
mixed  factors  design  is  as  follows:  4  spatial  ability  x  (2  communication  x  4  task 
difficulty  x  3  workload  metrics  x  2  replications).  All  levels  of  the  repeated 
measures  variables  will  be  counterbalanced  or  randomized  to  avoid  confounding 
order  with  experimental  treatment  results. 

Test  procedure  and  modify  iteratively.  The  actual  procedure  for  the 
experimental  paradigm  required  modification  from  its  conceptualization  to  its  final 
form.  This  iteration  was  accomplished  by  the  principal  investigator  and  the 
research  assistant  alternatively  serving  as  experimenter  and  subject  or  both  as 
team  members  until  the  procedure,  instructions,  equipment,  training,  and 
measurement  issues  had  been  satisfactorily  developed.  The  following 
parameters  were  established  empirically  during  these  iterative  modifications: 

•  Number  of  training  trials 

•  Preview  time  for  patterns 

•  Audio  communication  equipment 

•  Field  of  view  and  camera  angle 

•  Permissible  puzzle  patterns  constrained  by  video  view 

•  Instructions  to  team  members 

•  Method  of  recording  errors  (sketch) 

•  Anticipated  number  of  errors  influenced  design  of  dependent  variables. 

•  Time  allowed  for  pattern  building 

•  Power  for  obtaining  workload  measures  (by  two-trial  blocks,  not  for  each 
trial.) 

Procedure.  Subjects  are  introduced  to  the  experimental  room  and  the 
communication  equipment.  They  are  told  that  their  task  is  to  work  together  as 
teams  to  build  a  series  of  puzzle  patterns  from  blocks.  Before  data  collection 
begins,  each  team  completes  seven  practice  trials  in  which  they  become  familiar 
with  one  another’s  terminology  and  typical  strategies. 

In  the  telecommunication  condition,  one  team  member,  serving  as  the 
instructor,  sits  in  the  room  with  the  television  monitor  (Room  I)  and  the  other, 
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serving  as  the  builder,  in  the  room  with  the  camcorder  (Room  B).  The  instructor 
has  a  stack  of  24  patterns.  The  instructor’s  task  is  to  describe  how  to  build  a 
given  pattern.  The  builder  has  the  blocks  on  the  table.  The  builder’s  task  is  to 
build  the  pattern  that  is  described.  Both  subjects  view  the  two  patterns  for  a 
given  condition  (e.g.,  Moderate  Difficulty,  Telecommunication,  MCH)  for  10 
seconds.  After  the  preview,  the  Instructor  is  the  only  one  to  see  the  paper 
pattern.  The  measure  of  time  begins  when  the  experimenter  says  “Begin”  for 
each  trial.  It  ends  when  the  Instructor  signals  completion.  After  completing  both 
trials  in  a  given  condition,  a  workload  rating  is  obtained. 

A  similar  procedure  is  used  for  the  co-location  condition  except  that  the 
two  team  members  are  in  the  same  room.  Again,  the  builder  is  the  only  team 
member  allowed  to  touch  the  blocks  and  the  instructor  is  the  only  one  to  see  the 
paper  pattern.  Time  to  completion,  errors  in  pattern  built  (including  the  sketch  of 
any  incorrect  result),  and  workload  ratings  are  recorded  as  the  dependent 
variables. 

The  instructions  for  the  Instructor  and  the  Builder  team  members  in  both 
the  Telecommunication  and  the  Co-location  conditions  are  found  in  Table  B-4. 

Complete  protocol  with  a  sample  team.  After  completing  the  procedural 
modifications,  the  entire  protocol  (omitting  the  CLB)  was  completed  using  two 
graduate  students  as  subjects.  The  procedure  required  2  hours  to  complete  all  7 
training  trials  and  48  data  collection  trials.  The  results  for  this  team  are 
summarized  in  Table  B-5. 
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Table  B-4.  Instructions  for  builder  and  instructor  in  telecommunication  and 
co-location  conditions. 


Instructions  for  Block  Design 


Telecommunication  Condition: 

Instructions  for  the  builder:  In  this  portion  of  the  study  you  will  receive  a 
set  of  instructions  given  to  you  by  your  team  mate  located  in  another  room.  You  will  hear 
these  instructions  over  your  walkie-talkie.  You  will  place  these  red  and  white  blocks  as 
you  are  told  to  form  one  of  the  three  patterns  you  have  viewed.  The  blocks  consist  of  two 
red  sides,  two  white  sides,  and  two  sides  split  in  half  so  that  they  are  both  red  and  white. 
This  camera  is  here  so  that  your  team  mate  may  monitor  your  progress  and  correct  any 
mistakes  you  may  make.  Some  patterns  will  seem  harder  than  others.  After  completing 
three  patterns,  you  will  be  asked  to  fill  out  a  form  that  describes  the  amount  of  work  you 
think  was  involved  in  completing  the  previously  built  patterns.  This  is  a  subjective  measure 
and  will  not  be  the  same  for  all  people  so  do  not  feel  as  though  your  ratings  must  meet  a 
set  standard.  After  you  have  completed  the  measure  of  workload,  you  will  build  three 
more  patterns  and  fill  out  another  workload  evaluation  and  so  on  until  all  patterns  are 
completed  (there  are  twenty  seven).  Your  goal  is  to  work  as  quickly  as  possible  while 
attempting  to  build  a  completely  correct  pattern.  Your  team  will  receive  a  twenty  five 
dollar  reward  if  it  is  one  of  the  two  fastest  teams  with  the  fewest  errors.  You  may  ask  your 
team  mate  to  repeat  any  instructions  you  do  not  understand  by  depressing  the  talk  button 
on  your  own  walkie-talkie.  Are  there  any  questions? 

Instructions  for  the  person  with  the  patterns:  In  this  portion  of  the  study 
you  will  be  asked  to  describe  these  patterns  you  see  before  you  now  to  your  team  mate 
located  in  another  room.  Your  team  mate  has  a  set  of  blocks  in  order  to  achieve  this 
construction  which  have  two  red  sides,  two  white  sides,  and  two  sides  that  are  split  in  half 
so  that  they  are  both  red  and  white.  You  will  communicate  to  your  team  mate  via  a  set  of 
walkie-talkies  one  of  which  you  see  before  you.  You  talk  by  depressing  the  talk  button  for 
the  duration  of  the  time  you  need' to  speak.  Your  team  mate  has  the  option  of  asking  you 
to  repeat  any  instructions  s/he  does  not  understand.  Keep  in  mind  tour  team  mate  has 
viewed  the  patterns  you  are  describing  for  thirty  seconds  for  nine  block  patterns  and 
fifteen  seconds  for  four  block  patterns.  The  television  monitor  is  here  so  that  you  may 
monitor  your  team  mate’s  progress  and  correct  any  errors  s/he  may  make.  Once  you  have 
explained  three  patterns  you  will  be  asked  to  fill  out  a  workload  evaluation  which  will  let 
the  experimenter  know  how  much  work  you  believe  was  involved  in  completing  this  phase 
of  the  experiment.  When  this  evaluation  is  completed,  you  will  describe  three  more 
patterns  and  receive  another  evaluation  and  so  on  until  all  patterns  are  completed  (there 
are  twenty  seven).  Workload  evaluations  are  subjective  therefore  your  opinions  may  or 
may  not  match  someone  else’s.  Do  not  worry  you  are  not  trying  to  meet  a  standard  just 
state  your  own  opinion.  Your  goal  is  to  complete  these  patterns  as  quickly  as  possible 
making  as  few  errors  as  possible.  At  the  end  of  the  experiment,  the  two  teams  with  the 
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fastest  times  and  the  fewest  errors  will  receive  a  twenty  five  dollar  bonus.  Are  there  any 
questions? 

Colocated  Condition: 

Instructions  for  both  subjects:  In  this  phase  of  the  study  you  will  be  asked 
to  construct  the  patterns  you  see  before  you.  Only  one  of  you  will  have  access  to  the 
patterns  while  the  other  will  have  the  blocks.  However,  you  both  will  be  permitted  to  view 
the  three  patterns  occurring  in  the  ensuing  block.  If  the  patterns  contain  nine  blocks,  you 
will  be  allowed  to  view  them  for  thirty  seconds  and  if  there  are  four  blocks  you  may  view 
them  for  fifteen  seconds.  Only  one  designated  person  may  touch  the  blocks.  The  person 
with  the  designs  must  describe  to  the  other  person  how  to  situate  the  blocks  in  order  to 
create  the  pattern  s/he  sees.  Each  block  consists  of  two  red  sides,  two  white  sides,  and 
two  sides  split  in  half  so  that  they  are  both  red  and  white.  The  builder  may  at  any  time  ask 
the  instructor  to  repeat  instructions  that  were  not  understood,  however  the  builder  may 
not  ask  to  see  the  design  itself  nor  may  the  instructor  show  the  design  to  his  or  her  team 
mate.  After  completing  three  designs,  you  will  both  be  asked  to  fill  out  a  workload 
evaluation  which  will  tell  the  experimenter  how  much  work  you  each  feel  was  involved  in 
completing  this  phase  of  the  experiment.  These  evaluations  are  subjective  so  the 
evaluations  you  both  fill  out  may  not  reflect  the  same  ideas.  Do  not  worry  about  matching 
your  partners  evaluation,  the  experimenter  wants  to  know  what  each  of  your  personal 
views  are.  When  this  evaluation  is  completed,  you  will  be  asked  to  complete  three  more 
patterns  and  give  another  evaluation  and  so  on  until  all  patterns  are  completed  (there  are 
twenty  seven).  Your  goal  is  to  complete  the  patterns  as  quickly  as  possible  while  making 
as  few  errors  as  possible.  At  the  end  of  the  experiment,  the  two  teams  with  the  fastest 
times  and  the  fewest  errors  will  receive  a  twenty  five  dollar  reward.  Are  there  any 
questions? 
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Table  B-5.  Sample  of  One  Team’s  Performance  of  Experimental  Protocol. 


Co-location 


Pattern 

Very  Easy 

Easy 

Moderate 

Hard 

Difficulty 

Time 

9 

9.8 

24 

25.8 

(in  sec.) 
SWAT 

33 

33 

72 

72 

NASA-TLX 

MCH 

Lost:  Experimenter  error 

30  20 

50 

60 

Telecommunication 


Pattern 

Very  Easy 

Easy 

Moderate 

Hard 

Difficulty 

Time 

15.8 

27.67 

39.17 

35.17 

(in  sec.) 
SWAT 

33 

44 

61 

61 

NASA-TLX 

MCH 

Lost:  Experimenter  error 

20  30 

30 

70 

NOTE:  WL  adjusted  to  0  to  1 00  range 
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On  the  basis  of  these  data,  two  further  changes  were  made  in  the  protocol: 

•  The  experimenter’s  procedure  checklist  was  changed  to  make  it  easier  to 
collect  the  results  without  the  errors  that  led  to  the  loss  of  the  NASA-TLX  data 
in  the  sample  run. 

•  Debriefing  questions  were  added  to  collect  information  systematically  about 
the  subject’s  preferences  for  one  or  another  of  the  workload  measures. 

Conclusions 

The  work  described  in  this  paper  was  undertaken  to  establish  a  research 
paradigm  for  developing  a  satisfactory  evaluation  tool  for  telemedicine 
applications.  These  efforts  were  successful  in  establishing  the  feasibility  of  that 
research.  A  surrogate  task  (team  building  of  block  patterns)  was  developed  and 
56  patterns  of  measured  difficulty  were  designed,  produced,  and  tested.  This 
task  can  be  used  as  it  is  and  can  be  modified  to  incorporate  a  greater 
psychomotor  component,  when  such  a  component  proves  necessary  in  some 
experiments.  The  telecommunication  equipment  necessary  to  conduct  the 
research  was  acquired,  set  up,  and  tested.  The  possibility  of  uncontrolled 
individual  differences  in  spatial  ability  was  considered  for  some  populations  and 
a  measure  of  spatial  ability  was  determined  so  that  teams  can  be  stratified  on 
this  measure.  A  scientifically  sound  research  design  and  a  procedure  for 
implementing  that  design  were  developed.  The  entire  procedure  was  tested  and 
final  adjustments  were  made. 
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